我们研究了任务不合时宜的持续强化学习方法(tACRL)。 TACRL是一种结合了部分观察RL(任务不可知论的结果)和持续学习的困难(CL)的困难,即在任务的非平稳序列上学习。我们将tACRL方法与以前文献规定的软上限进行比较:多任务学习(MTL)方法,这些方法不必处理非平稳数据分布以及任务感知方法,这些方法可以在完整的情况下进行操作可观察性。我们考虑了先前未开发的基线,用于基于重播的复发性RL(3RL),其中我们增强了具有复发机制的RL算法,以减轻部分可观察性和经验经验的重播机制,以使CL中的灾难性遗忘。通过研究一系列RL任务的经验性能,我们发现3RL匹配并克服MTL和任务感知的软上限的情况令人惊讶。我们提出假设,可以解释不断的和任务不足学习研究的这个拐点。通过对流行的多任务和持续学习基准元世界的大规模研究,我们的假设在连续控制任务中进行了经验检验。通过分析包括梯度冲突在内的不同培训统计数据,我们发现证据表明3RL的表现超出其能够快速推断新任务与以前的任务的关系,从而实现前进的转移。
translated by 谷歌翻译
模块化是持续学习(CL)的令人信服的解决方案,是相关任务建模的问题。学习和组合模块来解决不同的任务提供了一种抽象来解决CL的主要挑战,包括灾难性的遗忘,向后和向前传输跨任务以及子线性模型的增长。我们引入本地模块组成(LMC),该方法是模块化CL的方法,其中每个模块都提供了局部结构组件,其估计模块与输入的相关性。基于本地相关评分进行动态模块组合。我们展示了对任务身份(IDS)的不可知性来自(本地)结构学习,该结构学习是特定于模块和/或模型特定于以前的作品,使LMC适用于与以前的作品相比的更多CL设置。此外,LMC还跟踪输入分布的统计信息,并在检测到异常样本时添加新模块。在第一组实验中,LMC与最近的持续转移学习基准上的现有方法相比,不需要任务标识。在另一个研究中,我们表明结构学习的局部性允许LMC插入相关但未遵守的任务(OOD),以及在不同任务序列上独立于不同的任务序列培训的模块化网络,而无需任何微调。最后,在寻找LMC的限制,我们在30和100个任务的更具挑战性序列上研究它,展示了本地模块选择在存在大量候选模块时变得更具挑战性。在此设置中,与Oracle基准的基线相比,最佳执行LMC产生的模块更少,但它达到了较低的总体精度。 CodeBase可在https://github.com/oleksost/lmc下找到。
translated by 谷歌翻译
持续学习领域(CL)寻求开发通过与非静止环境的交互累积随时间累积知识和技能的算法。在实践中,存在一种夸张的评估程序和算法解决方案(方法),每个潜在的潜在不相交的假设集。这种品种使得在CL困难中进行了衡量进展。我们提出了一种设置的分类,其中每个设置被描述为一组假设。从这个视图中出现了一棵树形的层次结构,更多的一般环境成为具有更严格假设的人的父母。这使得可以使用继承来共享和重用研究,因为开发给定设置的方法也使其直接适用于其任何孩子。我们将此想法实例化为名为SequoIa的公开软件框架,其特征来自持续监督学习(CSL)和持续加强学习(CRL)域的各种环境。除了来自外部图书馆的更专业的方法之外,SemoIa还包括一种易于延伸和定制的不断增长的方法。我们希望这一新的范式及其第一个实施可以帮助统一和加速CL的研究。您可以通过访问github.com/lebrice/squia来帮助我们长大树。
translated by 谷歌翻译
经典的机器学习算法通常假设绘制数据是i.i.d的。来自固定概率分布。最近,持续学习成为机器学习的快速增长领域,在该领域中,该假设放松,即数据分布是非平稳的,并且随着时间的推移而变化。本文通过上下文变量$ c $表示数据分布的状态。 $ c $的漂移导致数据分布漂移。上下文漂移可能会改变目标分布,输入分布或两者兼而有之。此外,分布漂移可能是突然的或逐渐的。在持续学习中,环境漂移可能会干扰学习过程并擦除以前学习的知识。因此,持续学习算法必须包括处理此类漂移的专业机制。在本文中,我们旨在识别和分类不同类型的上下文漂移和潜在的假设,以更好地表征各种持续学习的场景。此外,我们建议使用分布漂移框架来提供对连续学习领域常用的几个术语的更精确的定义。
translated by 谷歌翻译
鉴于部署更可靠的机器学习系统的重要性,研究界内的机器学习模型的解释性得到了相当大的关注。在计算机视觉应用中,生成反事实方法表示如何扰乱模型的输入来改变其预测,提供有关模型决策的详细信息。目前的方法倾向于产生关于模型决策的琐碎的反事实,因为它们通常建议夸大或消除所分类的属性的存在。对于机器学习从业者,这些类型的反事件提供了很少的价值,因为它们没有提供有关不期望的模型或数据偏差的新信息。在这项工作中,我们确定了琐碎的反事实生成问题,我们建议潜水以缓解它。潜水在使用多样性强制损失限制的解除印章潜在空间中学习扰动,以发现关于模型预测的多个有价值的解释。此外,我们介绍一种机制,以防止模型产生微不足道的解释。 Celeba和Synbols的实验表明,与先前的最先进的方法相比,我们的模型提高了生产高质量有价值解释的成功率。代码可在https://github.com/elementai/beyond- trial-explanations获得。
translated by 谷歌翻译
Artificial neural networks can learn complex, salient data features to achieve a given task. On the opposite end of the spectrum, mathematically grounded methods such as topological data analysis allow users to design analysis pipelines fully aware of data constraints and symmetries. We introduce a class of persistence-based neural network layers. Persistence-based layers allow the users to easily inject knowledge about symmetries (equivariance) respected by the data, are equipped with learnable weights, and can be composed with state-of-the-art neural architectures.
translated by 谷歌翻译
We consider the problem of two active particles in 2D complex flows with the multi-objective goals of minimizing both the dispersion rate and the energy consumption of the pair. We approach the problem by means of Multi Objective Reinforcement Learning (MORL), combining scalarization techniques together with a Q-learning algorithm, for Lagrangian drifters that have variable swimming velocity. We show that MORL is able to find a set of trade-off solutions forming an optimal Pareto frontier. As a benchmark, we show that a set of heuristic strategies are dominated by the MORL solutions. We consider the situation in which the agents cannot update their control variables continuously, but only after a discrete (decision) time, $\tau$. We show that there is a range of decision times, in between the Lyapunov time and the continuous updating limit, where Reinforcement Learning finds strategies that significantly improve over heuristics. In particular, we discuss how large decision times require enhanced knowledge of the flow, whereas for smaller $\tau$ all a priori heuristic strategies become Pareto optimal.
translated by 谷歌翻译
Token free approaches have been successfully applied to a series of word and span level tasks. In this work, we compare a byte-level (ByT5) and a wordpiece based (mT5) sequence to sequence model on the 51 languages of the MASSIVE multilingual semantic parsing dataset. We examine multiple experimental settings: (i) zero-shot, (ii) full gold data and (iii) zero-shot with synthetic data. By leveraging a state-of-the-art label projection method for machine translated examples, we are able to reduce the gap in exact match accuracy to only 5 points with respect to a model trained on gold data from all the languages. We additionally provide insights on the cross-lingual transfer of ByT5 and show how the model compares with respect to mT5 across all parameter sizes.
translated by 谷歌翻译
Artificial neural networks are functions depending on a finite number of parameters typically encoded as weights and biases. The identification of the parameters of the network from finite samples of input-output pairs is often referred to as the \emph{teacher-student model}, and this model has represented a popular framework for understanding training and generalization. Even if the problem is NP-complete in the worst case, a rapidly growing literature -- after adding suitable distributional assumptions -- has established finite sample identification of two-layer networks with a number of neurons $m=\mathcal O(D)$, $D$ being the input dimension. For the range $D<m<D^2$ the problem becomes harder, and truly little is known for networks parametrized by biases as well. This paper fills the gap by providing constructive methods and theoretical guarantees of finite sample identification for such wider shallow networks with biases. Our approach is based on a two-step pipeline: first, we recover the direction of the weights, by exploiting second order information; next, we identify the signs by suitable algebraic evaluations, and we recover the biases by empirical risk minimization via gradient descent. Numerical results demonstrate the effectiveness of our approach.
translated by 谷歌翻译
Parameter-efficient fine-tuning (PEFT) methods can adapt large language models to downstream tasks by training a small amount of newly added parameters. In multi-task settings, PEFT adapters typically train on each task independently, inhibiting transfer across tasks, or on the concatenation of all tasks, which can lead to negative interference. To address this, Polytropon (Ponti et al.) jointly learns an inventory of PEFT adapters and a routing function to share variable-size sets of adapters across tasks. Subsequently, adapters can be re-combined and fine-tuned on novel tasks even with limited data. In this paper, we investigate to what extent the ability to control which adapters are active for each task leads to sample-efficient generalization. Thus, we propose less expressive variants where we perform weighted averaging of the adapters before few-shot adaptation (Poly-mu) instead of learning a routing function. Moreover, we introduce more expressive variants where finer-grained task-adapter allocation is learned through a multi-head routing function (Poly-S). We test these variants on three separate benchmarks for multi-task learning. We find that Poly-S achieves gains on all three (up to 5.3 points on average) over strong baselines, while incurring a negligible additional cost in parameter count. In particular, we find that instruction tuning, where models are fully fine-tuned on natural language instructions for each task, is inferior to modular methods such as Polytropon and our proposed variants.
translated by 谷歌翻译